central model
Adaptive Federated Distillation for Multi-Domain Non-IID Textual Data
The widespread success of pre-trained language models has established a new training paradigm, where a global PLM is fine-tuned using task-specific data from local clients. The local data are highly different from each other and can not capture the global distribution of the whole data in real world. To address the challenges of non-IID data in real environments, privacy-preserving federated distillation has been proposed and highly investigated. However, previous experimental non-IID scenarios are primarily identified with the label (output) diversity, without considering the diversity of language domains (input) that is crucial in natural language processing. In this paper, we introduce a comprehensive set of multi-domain non-IID scenarios and propose a unified benchmarking framework that includes diverse data. The benchmark can be used to evaluate the federated learning framework in a real environment. To this end, we propose an Adaptive Federated Distillation (AdaFD) framework designed to address multi-domain non-IID challenges in both homogeneous and heterogeneous settings. Experimental results demonstrate that our models capture the diversity of local clients and achieve better performance compared to the existing works. The code for this paper is available at: https://github.com/jiahaoxiao1228/AdaFD.
A Privacy-Preserving Indoor Localization System based on Hierarchical Federated Learning
Jan, Masood, Njima, Wafa, Zhang, Xun
Location information serves as the fundamental element for numerous Internet of Things (IoT) applications. Traditional indoor localization techniques often produce significant errors and raise privacy concerns due to centralized data collection. In response, Machine Learning (ML) techniques offer promising solutions by capturing indoor environment variations. However, they typically require central data aggregation, leading to privacy, bandwidth, and server reliability issues. To overcome these challenges, in this paper, we propose a Federated Learning (FL)-based approach for dynamic indoor localization using a Deep Neural Network (DNN) model. Experimental results show that FL has the nearby performance to Centralized Model (CL) while keeping the data privacy, bandwidth efficiency and server reliability. This research demonstrates that our proposed FL approach provides a viable solution for privacy-enhanced indoor localization, paving the way for advancements in secure and efficient indoor localization systems.
A Novel Pearson Correlation-Based Merging Algorithm for Robust Distributed Machine Learning with Heterogeneous Data
Rahmat, Mohammad Ghabel, Khalilian, Majid
Federated learning faces significant challenges in scenarios with heterogeneous data distributions and adverse network conditions, such as delays, packet loss, and data poisoning attacks. This paper proposes a novel method based on the SCAFFOLD algorithm to improve the quality of local updates and enhance the robustness of the global model. The key idea is to form intermediary nodes by merging local models with high similarity, using the Pearson correlation coefficient as a similarity measure. The proposed merging algorithm reduces the number of local nodes while maintaining the accuracy of the global model, effectively addressing communication overhead and bandwidth consumption. Experimental results on the MNIST dataset under simulated federated learning scenarios demonstrate the method's effectiveness. After 10 rounds of training using a CNN model, the proposed approach achieved accuracies of 0.82, 0.73, and 0.66 under normal conditions, packet loss and data poisoning attacks, respectively, outperforming the baseline SCAFFOLD algorithm. These results highlight the potential of the proposed method to improve efficiency and resilience in federated learning systems.
Distributed Differentially Private Data Analytics via Secure Sketching
Burkhardt, Jakob, Keller, Hannah, Orlandi, Claudio, Schwiegelshohn, Chris
We explore the use of distributed differentially private computations across multiple servers, balancing the tradeoff between the error introduced by the differentially private mechanism and the computational efficiency of the resulting distributed algorithm. We introduce the linear-transformation model, where clients have access to a trusted platform capable of applying a public matrix to their inputs. Such computations can be securely distributed across multiple servers using simple and efficient secure multiparty computation techniques. The linear-transformation model serves as an intermediate model between the highly expressive central model and the minimal local model. In the central model, clients have access to a trusted platform capable of applying any function to their inputs. However, this expressiveness comes at a cost, as it is often expensive to distribute such computations, leading to the central model typically being implemented by a single trusted server. In contrast, the local model assumes no trusted platform, which forces clients to add significant noise to their data. The linear-transformation model avoids the single point of failure for privacy present in the central model, while also mitigating the high noise required in the local model. We demonstrate that linear transformations are very useful for differential privacy, allowing for the computation of linear sketches of input data. These sketches largely preserve utility for tasks such as private low-rank approximation and private ridge regression, while introducing only minimal error, critically independent of the number of clients. Previously, such accuracy had only been achieved in the more expressive central model.
Differentially Private Reward Estimation with Preference Feedback
Chowdhury, Sayak Ray, Zhou, Xingyu, Natarajan, Nagarajan
Learning from preference-based feedback has recently gained considerable traction as a promising approach to align generative models with human interests. Instead of relying on numerical rewards, the generative models are trained using reinforcement learning with human feedback (RLHF). These approaches first solicit feedback from human labelers typically in the form of pairwise comparisons between two possible actions, then estimate a reward model using these comparisons, and finally employ a policy based on the estimated reward model. An adversarial attack in any step of the above pipeline might reveal private and sensitive information of human labelers. In this work, we adopt the notion of label differential privacy (DP) and focus on the problem of reward estimation from preference-based feedback while protecting privacy of each individual labelers. Specifically, we consider the parametric Bradley-Terry-Luce (BTL) model for such pairwise comparison feedback involving a latent reward parameter $\theta^* \in \mathbb{R}^d$. Within a standard minimax estimation framework, we provide tight upper and lower bounds on the error in estimating $\theta^*$ under both local and central models of DP. We show, for a given privacy budget $\epsilon$ and number of samples $n$, that the additional cost to ensure label-DP under local model is $\Theta \big(\frac{1}{ e^\epsilon-1}\sqrt{\frac{d}{n}}\big)$, while it is $\Theta\big(\frac{\text{poly}(d)}{\epsilon n} \big)$ under the weaker central model. We perform simulations on synthetic data that corroborate these theoretical results.
zPROBE: Zero Peek Robustness Checks for Federated Learning
Ghodsi, Zahra, Javaheripi, Mojan, Sheybani, Nojan, Zhang, Xinqiao, Huang, Ke, Koushanfar, Farinaz
Privacy-preserving federated learning allows multiple users to jointly train a model with coordination of a central server. The server only learns the final aggregation result, thus the users' (private) training data is not leaked from the individual model updates. However, keeping the individual updates private allows malicious users to perform Byzantine attacks and degrade the accuracy without being detected. Best existing defenses against Byzantine workers rely on robust rank-based statistics, e.g., median, to find malicious updates. However, implementing privacy-preserving rank-based statistics is nontrivial and not scalable in the secure domain, as it requires sorting all individual updates. We establish the first private robustness check that uses high break point rank-based statistics on aggregated model updates. By exploiting randomized clustering, we significantly improve the scalability of our defense without compromising privacy. We leverage our statistical bounds in zero-knowledge proofs to detect and remove malicious updates without revealing the private user updates. Our novel framework, zPROBE, enables Byzantine resilient and secure federated learning. Empirical evaluations demonstrate that zPROBE provides a low overhead solution to defend against state-of-the-art Byzantine attacks while preserving privacy.
Collaborative Domain Blocking: Using federated NLP To Detect Malicious Domains
Current content filtering and blocking methods are susceptible to various circumvention techniques and are relatively slow in dealing with new threats. This is due to these methods using shallow pattern recognition that is based on regular expression rules found in crowdsourced block lists. We propose a novel system that aims to remedy the aforementioned issues by examining deep textual patterns of network-oriented content relating to the domain being interacted with. Moreover, we propose to use federated learning that allows users to take advantage of each other's localized knowledge/experience regarding what should or should not be blocked on a network without compromising privacy. Our experiments show the promise of our proposed approach in real world settings. We also provide data-driven recommendations on how to best implement the proposed system.
Leveraging Federated Learning for Medical Imaging
If you follow Artificial Intelligence and Machine Learning, you've probably heard a lot about the applications of these emerging technologies in the healthcare space. This entire hype train has stemmed from the raw processing power that machine learning models are able to employ. Recently, Machine Learning models have been utilized and leveraged to generate and build statistical models from medical data. These models have had various use cases, ex. AI has been used to assist radiologists in performing diagnoses using Convolutional Neural Networks. Despite this, there has been an evident hesitation to exploit medical data to develop efficient and effective machine learning models.
Low-Latency Cooperative Spectrum Sensing via Truncated Vertical Federated Learning
Zhang, Zezhong, Zhu, Guangxu, Cui, Shuguang
In recent years, the exponential increase in the demand of wireless data transmission rises the urgency for accurate spectrum sensing approaches to improve spectrum efficiency. The unreliability of conventional spectrum sensing methods by using measurements from a single secondary user (SU) has motivated research on cooperative spectrum sensing (CSS). In this work, we propose a vertical federated learning (VFL) framework to exploit the distributed features across multiple SUs without compromising data privacy. However, the repetitive training process in VFL faces the issue of high communication latency. To accelerate the training process, we propose a truncated vertical federated learning (T-VFL) algorithm, where the training latency is highly reduced by integrating the standard VFL algorithm with a channel-aware user scheduling policy. The convergence performance of T-VFL is provided via mathematical analysis and justified by simulation results. Moreover, to guarantee the convergence performance of the T-VFL algorithm, we conclude three design rules on the neural architectures used under the VFL framework, whose effectiveness is proved through simulations.
Remote Data Science Part 2: Introduction to PySyft and PyGrid
This post is a continuation of "Remote Data Science Part 1: Today's privacy challenges in BigData". The previous blog talks about the importance of understanding privacy challenges in BigData and explains how "Remote Data Science" enables three privacy guarantees for the data scientist and the data owner. This blog explains the different components of Remote Data Science. Understand "Model-centric FL" and "Data-centric FL" while both are deployable in Remote Data Science Architecture. PyGrid is a peer-to-peer network of data curators/owners and data scientists who can collectively train AI models using PySyft on decentralised data (Data never leaves the device).